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Abstract 

Complex disorders are typically characterized by multiple phenotypes. Analyzing these phenotypes jointly is expected to be 
more powerful than dealing with one of them at a time. A recent approach (O'Reilly et al. 2012) is to regress the genotype at 
a SNP marker on multiple phenotypes and apply the proportional odds model. In the current research, we introduce an 
explicit expression for the score test statistic and its non-centrality parameter that determines its power. Same simulation 
studies as those reported in Galesloot et al. (2014) were conducted to assess its performance. We demonstrate by 
theoretical arguments and simulation studies that, despite its potential usefulness for multiple phenotypes, the 
proportional odds model method can be less powerful than regular methods for univariate traits. We also introduce an 
implementation of the proposed score statistic in an R package named iGasso. 
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Introduction 

Complex human disorders are often characterized by multiple 
phenotypes. Some of them might be categorical while others might 
be continuous. For instance, patients with Bardet-Biedl syndrome 
often suffer from vision loss, hypertension and high cholesterol 
level caused by obesity, Polydactyly, and other abnormalities. In 
order to map the genetic variants underlying such disorders, it is 
highly desirable to analyze all available phenotypes simultaneous- 
ly. However, it is challenging to jointly model these phenotypes, 
especially when they are of different data types [1]. 

Let y denote a/cxl vector of k phenotypes on an individual 
and g his/her genotype at a marker. If all the components of y are 
continuous, one may use MANOVA given genetype g. When the 
components of y are of mixed data types, the choices are limited. 
One popular method is to first analyze each component 
individually and then combine the test statistics through a meta- 
analysis [2,3]. These methods model the phenotype vector y in 
terms of the genetic data g. 

For a single-nucleotide polymorphism (SNP), the distribution of its 
genotype g is trinomial. It is appealing to model g as a function of y 
[4,5]. Furthermore, since there is a natural ordering in the three 
genotypes at the SNP (assuming that the possibility of over- 
dominance is ignorable), one can use the ordinal logistic regression 
(a.k.a., the proportional odds model or the cumulative logit model) 
[6]. One immediate advantage of using the proportional odds model 
is that, unlike other methods, there is no need to make assumptions 
on the genetic effect such as additive, dominant, or recessive. The 
usefulness of this approach has been demonstrated via analyses of 



various data [5] . It is one of the best currently available multivariate 
methods [7] . However, it is the slowest one [7] . 

The test used in [5] is the likelihood ratio test (LRT). It involves 
numerical maximization under both the null hypothesis and the 
alternative hypothesis. We introduce a score test statistic using 
standard statistics theory. This statistic is asymptotically equivalent 
to the likelihood ratio test but computationally much faster due to 
the availability of its explicit expression, a feature useful in 
genome-wide association analysis. This explicit expression also 
gives insight on how the proportional odds model works in the 
context of genetic association analysis. 

This report is organized as follows. We first introduce an explicit 
form of the score statistic and its non-centrality parameter. The 
form of this score statistic provides some insights on the ability of 
this method to detect association. The performance of the this 
score statistic is evaluated using the same simulation scenarios used 
in [7] . Finally, we consider an important case where the phenotype 
vector y is univariate and binary to see how this test works for 
univariate phenotypes. 

Results 

The score statistic 

The genetic data are assumed to come from a biallelic marker 
such as a single-nucleotide polymorphism (SNP). Let a denote the 
reference allele and A the other. For simplicity, we use 0, 1, and 2 
to represent genotypes AA, Aa, and aa, respectively. Regardless of 
the data types of the components of y, the genotype g follows a 
trinomial distribution. In most cases, the effect of an allele is 
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monotonia That is, as the number of a alleles increases from 0 to 
2, the effect of genotypes AA, Aa, and aa is non-decreasing or 
non-increasing. Over-dominance effect exists but is rather rare. 
Given this consideration, we model the genotype g in terms of 
phenotype y using the proportional odds model [5]. 

Let Ji/(y) = Pr (g = j\y) denote the probability that an individ- 
ual's genotype g is j given phenotypic value y. In the current 
situation, the proportional odds model models the cumulative 
probabilities 7to(y) and 7trj(y) + Jti(y) joindy as follows: 



i°g(i2&)-«i+*. 



log 



fto(y) + fti(y) 
t2(y) 



--a 2 +P'y. 



(1) 



(2) 



Here oti and a 2 are intercepts and /? is a vector of coefficients. 
This model implies a 2 >o<i because 7io(y) + rci(y)>7io(y)- Since 
7I o(y) + 7I l(y)= 1 — ^(y), an alternative form of equation (2) is 



iog(^^W 2+ /f'y. 



?C2(y) 



Equation (1) models the effect of phenotype y on the odds of 
genotype AA versus Aa or aa while equation (2) models the effect 
of phenotype y on the odds of genotype aa versus Aa or AA. We 
have 



to(y)= 



exp («i+/7'y) 
l+exp( ai +jJ'y)' 



712(50 l + exp( a2 +/?'y)' 
and 7ti(y) is determined by 7ti(y)= 1 — 7Co(y) — rc 2 (y)- This model 
assumes that the difference of the left hand side of (1) or (2) for two 
phenotype vectors y! and y 2 depends only on — y 2 ) and is 
independent of genotype aa or A A: 



log 



rco(yi) 
i-rco(yi) 



log 



i-to(y 2 ) 



-PtSi-Yd, (3) 



= log 



i-^ 2 (yi) 
t2(yi) 



log 



i-^2(y 2 ) 
^(y 2 ) 



(4) 



The hypotheses of interest are 

H 0 :/? = 0, //1 :y?#0. 



(5) 



These hypotheses can be tested by the likelihood ratio statistic. 
To introduce the score statistic, define 



\i:g,-=0 i:g,#0 y \<:g,=2 i:g ! -#2 / 

where So and 7t 2 are the sample proportions of the genotypes for 
which g = 0 and 2, respectively, w is the difference of two weighted 
summations of y,. The summation in the first pair of parentheses 
weights y, with g, = 0 more than other y,s (i.e., 1 versus 7t 2 ) while the 
summation in the second pair of parentheses weights y, with g t = 2 
more (i.e., 1 versus So). Let Ti\ = 1 — %q — 7t 2 . It is shown in the 
Methods section that a score statistic for testing hypotheses (5) is 



1 



where 



n(l — 7io)(l — — 712) 



w'V-'w, 



v=«- 1 £y,'y ! -n- 1 £y,'-«- 1 Ey- 
1=1 i=i i=i 

is the sample variance matrix of y,, i = 1, ...,«. The non-centrality 
parameter of S is 



NCP=- 



£(w)'V- 1 £(w) 



That is, 



«(l-7Co)(l-7Ci)(l-7t 2 )' 
where the expectation in _E(w) is taken under the alternative. This 
NCP can be used to compute power at significance level a. in the 
following way: 

>'><\ ■/' ,,> 

where X follows a chi-square distribution with df = k and non- 
centrality parameter NCP and Xi-xk ls me critical value from a 
chi-square distribution with df = k and non-centrality parameter 0. 

Simulation Study 

The simulation study consists of two parts. In the first part, 
multivariate phenotypes were simulated the same way as in [7]. 
Genotype data at a single SNP were simulated with minor allele 
frequency q under the assumption Hardy-Weinberg equilibrium. 
Three phenotypes, denoted by y\ ,72, and yj, were simulated using 
the following relationship: 



to(yi)t2(yi) 



rco(y 2 )rc 2 (y 2 ) 



(1 -TtoCyOXl -^(yi)) (1 - Jto(y 2 )Xl -^(y 2 )) 
= exp(o£! —0:2) 

does not depend on the value of y. 

Let i be the index for the (th individual in a sample of size n, the 
log-likelihood function is 





( 


rG-a\ 




MVN 




a 2 






V 


"1 





1-A? 



Symmetry 




i-hi 



/(ai,a 2 ,/?;{y,})= J2 J2 lo 8 (^(y,-))- 

J =0,1,2 i:g,-=7 



where ri?i 2 , ri?i3, and r£ 2 3 are pre-spedfied residual correla- 
tions between phenotypes excluding the QTL effect, rG= 1 or — 1 
controls the effect direction of y\ (those ofj> 2 and y$ axe fixed at 1) 

and Oj = yjhj /2(l —q)q. Three scenarios were considered with 
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only 1, 2, or all phenotypes were associated with the SNP. See [7] 
for more details. For each simulated data set, the LRT />-value is 
obtained from the R package MultiPhen and the score p-value is 
obtained from the R package iGasso. Empirical rejection rates 
over 1 ,000 replicates are reported in Table 1 . The performance of 
the score test is very similar to that of LRT. The empirical power 
are very close to the power of LRT reported in [7] . 

Is the proportional odds model always more powerful than the 
usual tests of association? To address this question, simulation 
were conducted on univariate phenotypes. The description of the 
simulation studies is provided in the Methods section. In addition 
to the proposed score statistic, the other test statistics used in the 
simulation include the Pearson's chi-square test, the Cochran- 
Armitage trend test, and the likelihood ratio test for the 
proportional odds model. The number of simulation replicates is 
fixed at 10,000. The sample size is 2,000. Half of the subjects are 
cases and half are controls. The simulated type I error rate for all 
these statistics is reported in Table 2. The empirical rejection rates 
are very close to their nominal levels, which are 0.1, 0.01, and 
0.005. The simulated power is presented in Figures 1, 2, and 3. It 
is striking to see that for recessive models the proportional odds 
model is the least powerful. For instance, when prevalence AT = 0.1 
and minor allele frequency p = 0.3, the power are 0.486 for Chi- 
Square test, 0.353 for Trend test, and 0.19 for both LRT and 
Score test. For other two models, there are situations it is more 
powerful than other methods. The simulated power for the S 
statistic is in line with the calculated power reported in Table 3. 



assumption and adopt a multinomial logistic regression. For our 
simulation studies, the multinomial logistic regression would be 
equivalent to the Pearson's chi-square test statistic. There are quite 
few implementations of the multinomial logistic regression, for 
instance, the multinom function in R package nnet. 

Methods 

Derivation of the score statistic 

The first-order derivatives of the log-likelihood function 
l{oi\,OL2,P) are 

= ^ (1 -*„&,■))- ^ r 



>«/=0 



-7io(y,-)-7t2(y,-)' 



1^ i r^v 2^ ( 1_77: 2(y,)), 



dl 



(i-Ko(y,))y,+ 

i: gj =0 

E ( 7t 2(y,)-Ko(y,))y / - ( 1_7C 2(y,-))y,-, 



Discussion 



and the second-order derivatives are 



In this report, we introduced a score test statistic for the 
proportional odds model for testing the association between a SNP 
and multiple phenotypes and provided an implementation of this 
statistic. Same simulation studies as those reported in [7] were 
conducted to assess it performance. We also did simulation analyses 
to study the performance of proportional odds model for univariate 
phenotypes which is covered by [5]. Although appealing to studies 
on multiple phenotypes, this method may be less powerful for 
univariate traits than regular methods. For case-control data, our 
results suggest that the traditional Pearson's chi-square test and the 
Cochran- Armitage trend test are preferred when the disease allele 
frequency is less than 0.5 and the disease is recessive. 

Nonetheless, the proportional odds model method provides a 
convenient way for analyzing multiple phenotypes, especially 
when these phenotypes are of different types [5]. If the 
proportional odds assumption is of concern, one may remove this 



E 



J2 "oOfrXl-lofri)) 



i; gi = 0 

7Co(y i )(l-7c 0 (y i ))(l-27 C o(y i )) , 7i 0 (y,) 2 (l -«b(y,)) 2 ' 



rci(y,) 



*i(y,r 



Table 1. Non-centrality value and the associated power (presented in parentheses) for models used in the simulation studies. 





K 


Frequency of Allele a 


Effect of Allele a 






P 


Recessive 


Additive 


Dominant 


0.01 


0.1 




4.6780 (0.2597) 


14.4110 (0.8387) 




0.3 


2.8697 (0.1329) 


9.7282 (0.6225) 


18.5977 (0.9339) 




0.4 


6.9507 (0.4323) 






0.1 


0.1 




5.7000 (0.3374) 


17.6383 (0.9182) 




0.3 


3.4771 (0.1730) 


1 1 .7847 (0.7343) 


22.5123 (0.9737) 




0.4 


8.4233 (0.5379) 







The relative genotype risks are/] // 0 = l,/ 2 //o = 1-5 for recessive models; f\/fo = 1.5,7i//o = 1.5 for dominant models; and fi/fo = 1.25,/ 2 //o = 1.5 for additive models. 
doi:1 0.1 371 /joumal.pone.01 0691 8.t001 
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Table 2. Simulated type I error rate in the case of a univariate phenotype under various generating models. 





Penetrance (A) Frequency of Allele a {p) 


Significance Level 


Test Statistic 












Trend 


Chi-Square 


LRT 


Score 


0.01 0.1 


0.1 


0.085 


0.095 


0.091 


0.091 




0.01 


0.007 


0.007 


0.006 


0.007 




0.005 


0.005 


0.004 


0.004 


0.004 


0.3 


0.1 


0.077 


0.091 


0.082 


0.082 




0.01 


0.009 


0.008 


0.010 


0.010 




0.005 


0.006 


0.002 


0.005 


0.005 


0.1 0.1 


0.1 


0.096 


0.099 


0.101 


0.102 




0.01 


0.011 


0.009 


0.010 


0.010 




0.005 


0.006 


0.006 


0.007 


0.007 


0.3 


0.1 


0.116 


0.117 


0.109 


0.108 




0.01 


0.009 


0.011 


0.010 


0.010 




0.005 


0.006 


0.007 


0.003 


0.003 


0.01 0.1 


0.1 


0.085 


0.095 


0.091 


0.091 




0.01 


0.007 


0.007 


0.006 


0.007 




0.005 


0.005 


0.004 


0.004 


0.004 


0.3 


0.1 


0.077 


0.091 


0.082 


0.082 




0.01 


0.009 


0.008 


0.010 


0.010 




0.005 


0.006 


0.002 


0.005 


0.005 


0.1 0.1 


0.1 


0.096 


0.099 


0.101 


0.102 




0.01 


0.011 


0.009 


0.010 


0.010 




0.005 


0.006 


0.006 


0.007 


0.007 


0.3 


0.1 


0.116 


0.117 


0.109 


0.108 




0.01 


0.009 


0.011 


0.010 


0.010 




0.005 


0.006 


0.007 


0.003 


0.003 


0.01 0.1 


0.1 


0.085 


0.095 


0.091 


0.091 




0.01 


0.007 


0.007 


0.006 


0.007 




0.005 


0.005 


0.004 


0.004 


0.004 


0.3 


0.1 


0.077 


0.091 


0.082 


0.082 




0.01 


0.009 


0.008 


0.010 


0.010 




0.005 


0.006 


0.002 


0.005 


0.005 


0.1 0.1 


0.1 


0.096 


0.099 


0.101 


0.102 




0.01 


0.011 


0.009 


0.010 


0.010 




0.005 


0.006 


0.006 


0.007 


0.007 


0.3 


0.1 


0.116 


0.117 


0.109 


0.108 




0.01 


0.009 


0.011 


0.010 


0.010 




0.005 


0.006 


0.007 


0.003 


0.003 



The test statistics are: Trend — Cochran-Armitage trend test; Chi-Square — Pearson's chi-square test; LRT — the likelihood ratio test for the proportional odds model 
computed by using the polr function in the R package MASS; Score — the proposed score statistic computed by using the SNPass.test function in the R package iGasso. 
doi:1 0.1 371 /journal.pone.01 0691 8.t002 
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X ^ofriXl -7r 0 (y,))y', 



1 t-Si = i 



■ X 7r 2(y/)(l-m(y/)), 
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oi=- Yl re o(y,)(i — ^(y^y/y' — E ^(y.Xi-^Cy/My/y'- 
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Under Hq : fi = 0, Kj(y t )j = Q, 1, 2 no longer depends on y,-. So 
their values are simply denoted by 7To,7ii, and 712, respectively. Let 
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Chi-Square 
LRT 
Score 
Trend 



! 

Q- 



0.8 



0.6 



0.4 



0.2 



K = 0.1 



- 0.8 



0.6 



0.4 



0.2 



K = 0.01 



p = 0.3 



p = 0.4 



Figure 1. Simulated power for recessive model. The relative genotype risks are fi/fo = \,fi/fa = 1-5. /(represents disease prevalence and p is 
the frequency of allele a. The abbreviations for the test statistics are the same as in Table 2. 
doi:1 0.1 371 /journal.pone.01 0691 8.g001 



a = (tx.\ ,ct 2 )' . The expected Fisher information matrix evaluated at 
H 0 :fi = 0h 



( E{8 2 l/8a 2 ) 
E{8 2 lj8d.\8a 2 ) 
E(d 2 l/8ai8P) 

V 



E(8 2 l/8a.idv. 2 ) E(8 2 l/8ai8fi')\ 

E(8 2 l/84) E(8 2 l/8oc 2 8p') 
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Chi-Square 
LRT 
Score 
Trend 



! 

Q- 



0.8 - 



0.6 - 



0.4 - 



0.2 - 



K = 0.1 



- 0.8 



0.6 



0.4 



- 0.2 



K = 0.01 



p = 0.1 



p = 0.3 



Figure 2. Simulated power for additive model. The relative genotype risks are /i//o = l -25,/?//o = 1.5. K represents disease prevalence and p is 
the frequency of allele a. The abbreviations for the test statistics are the same as in Table 2. 
doi:1 0.1 371 /journal.pone.01 0691 8.g002 



where the matrix partition is in an obvious manner. By standard where W is 81/ 8/i evaluated at Hq: 

asymptotic theory, the score statistic is 

w=(i-7to) Y yi+fe-rco) Y y/-( 1_7c 2) Z y ; 

5'=w'[%-4 /( I M 1 I^] W i:g; = 0 /:?,- = 1 i m = 2 

= n(i-7to)(i-7ri)(i-7i 2 ) w ' y ' w ' = ( Y y' +ni - Z y- ) - ( Z y < +7r ° Z y * ) • 

V'«i = 0 itf/^O / \i: gi = 2 r. gj ^2 J 



PLOS ONE | www.plosone.org 



6 



September 2014 | Volume 9 | Issue 9 | e106918 



Testing Association with Multiple Phenotypes 



Chi-Square 
LRT 
Score 
Trend 



K = 0.1 



- O.f 



- 0.6 



- 0.4 



! 

Q- 



0.8 - 



- 0.2 



K = 0.01 



0.6 - 



0.4 - 



0.2 - 



p = 0.1 



p = 0.3 



Figure 3. Simulated power for dominant model. The relative genotype risks are f\/f 0 = 1.5,/ 2 //o = 1.5. K represents disease prevalence and p is 
the frequency of allele a. The abbreviations for the test statistics are the same as in Table 2. 
doi:1 0.1 371 /journal.pone.01 0691 8.g003 



The unknown values of no, n\, and 312 are estimated by their 
sample genotype proportions, respectively. 

Simulation Studies 

Here is a description of the simulation procedure for the case of 
a dichotomous phenotype. Suppose the trait is Mendelian. Let po, 
P\, and P2 denote the frequencies of genotypes AA, Aa and aa in 
general population and fo, f\, and fx their penetrances, 
respectively. The prevalence of the disease would be K=pofo 
+Plfi +Pzf2- The genotype frequencies in cases are n\j=Pjf]/ 
K,j = 0,1,2, and in controls are noj=Pj(l— fj)/ (I— K),j = 0,1, 2. 
In this situation, the variance of y is (j>(\—(j)) where ^ is the 
proportion of cases. The non-centrality parameter (NCP) of test 
statistic S is equal to 



B(l-JH))(l-»Ci)(l-7t2MW) 
_ wfll-fl \pi(K-f 2 ) - Po (K-f 0 )+ Po p 2 (f2-fo)] 2 

[K(\-K)} 2 (l-7C0)(l-7Cl)(l-7C 2 ) 

Let p be the population frequency of allele a. Assuming Hardy- 
Weinberg equilibrium in the population, the frequencies of 
genotypes AA, Aa, and aa are po = (1 — p) 2 ,Pi =2p(l — p), and 
P2=p 2 , respectively. Let }', =/i//o> i=l,2, be the relative risk of 
genotype i to genotype 0. A data generating model is completely 
determined by K, p, y l , and y 2 . This is because 
fo=K/(po + y i p\+y 2 P2), fi=Yifo, an d f2 = y%fa- Hence the 
genotype frequencies in cases and controls are determined and 
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data can be simulated. We consider a dominance model (yj = y 2 ), a 
recessive model (y l = 1), and an additive model (yj =(1 +y 2 )/2). 
The NCPs for the models used in simulation are reported in 
Table 3. So are the power associated with these NCPs. 

R function SNPass.test 

The R function SNPass.test in the package iGasso implements 
the proposed score statistic. R users can download and install 
iGasso from CRAN (http:/ /cran.r-project.org/) or any CRAN 
mirror. 
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